research_guidefandomcom_ru-20200214-history
Russian Learner Parallel Corpus as a tool for translation studies (доклад на Диалог)
Аннотация на русском English abstract Похожая статья на русском в Нижний Новгород Introduction The present paper is about the project of Learner Translator Corpus, which is currently under development by a group of Tyumen-based linguists. We will discuss the feasibility of such a corpus and existing analogues, will describe the current status of corpus building and the tasks which could possibly be accomplished with it. Outline of the problem and preceding research Corpus linguistics has been enjoying the justified attention of translation studies for quite a long time, which is supported at least by the fact that there exist functioning translation corpora. The cooperation of corpus linguistics and translation studies gave birth to a wide amount of translational corpora types. At the moment there are monolingual translational corpora (e.g., Translational English Corpus by Mona Baker), multilingual aligned corpora with texts created by professional translators (e.g., parallel sub-corpus of Russian National Corpus), however, the modern range of corpora certainly lacks multilingual translational corpora compiled from non-professional translations. The feasibility of using it is not a subject to doubt as translational mistakes are of interest for researchers in the fields of translation studies and psycholinguists, as well as for methodists and teachers. As a matter of fact, such a corpus is not a brand new idea. Since 2004 the Department of Applied Linguistics of the Ulyanovsk State Technical University (Russia) carried out a similar project 2005. Unfortunately, this corpus is not available on-line, which makes it impossible to make use of it, moreover, 1 million word tokens is not a sufficient number for universal deductions. One can also recall Comparable Learner Translation Corpus (cf. Kübler 2008), but it does not include Russian texts. Aims of the project Thus, the aims of the present project of Learner Translator Corpus include not only the compilation of a corpus of imperfect translated texts, aligned with their source texts, which would be large enough to give reliable data on the mistakes of non-professional translators, but also providing easy access to it via the Internet to all interested researchers. Their potential range is quite wide. Our corpus is intended to consist of 10 million word tokens, both in English and in Russian texts. Each text exists as a source and a translation (or 'one-to-many' - several translations of the same source). Where can one find imperfect translations? Of course every translator from time to time makes mistakes, but their number is unstable and in fact they can be quite rare. Therefore, we came to conclusion that the only rational source of our raw material is students' translations. They do regularly contain mistakes, they already exist in digital form and they can be comparatively easy to obtain. What one can do with this corpus? It seems that research based on translation corpora is underrepresented at computational linguistics forums such as Dialogue. At the same time the study of translated texts versus originals yields valuable results for computer-powered monolingual and contrastive research. A new scope to such studies can be added by the analysis of non-professional' 'imperfect' '''translations based on the adequate and representational corpus which our project is aimed at. The '''Learner '''Translator Corpus enables the researcher to find out: 1) whether students follow the recognized conformities of English-Russian translation as well as to establish such conformities on the basis of the “negative” material the Corpus provides. The analysis of a variety of imperfect translations against originals can show 2) which of the strategies chosen can be considered erroneous due to the distortions in pragmatic and semantic plains of the source text and 3) makes it possible to study psycholinguistic factors which have impact on the choice. Besides comparative analysis of several versions of the same text helps to see 4) which of the mistakes are translation-wise relevant and which can be dismissed as formal linguistic faults having no detrimental effect on the translation in general. With any luck this analysis can be formalized on the basis of frequencies in translation memory-like systems. For example, mistakes related to verb aspect choice (or noun case, or subject and predicate agreement) in Russian translations usually do not corrupt semantic equivalence to the source. At the same time, lexical mistakes in translating connotations or metaphors lead to text tonality changing and distortion of text pragmatics. Similarly modeling syntax after foreign patterns often damages semantics of the text. 5) Analysis of imperfect translations can be used to discover translational universals or translational units that pose no problems in a certain language combination in a certain text-type. Such analysis is valuable for both Translation Studies and Computational Linguistics. Within the first of the mentioned domains it helps: 6) to improve the methods of quality assurance and translation evaluation, thus approaching the much-discussed issue of the quality in translation, 7) to develop typologies of mistakes in translation applicable to different text-types, 8) to make didactic conclusions and target mistakes typical for Russian translation students in their training, including Russian-language mistakes induced by source texts and revealed through the analysis of the translation sub-corpus. It seems possible to use the Corpus: 9) to identify categories of mistakes which give rise to the distortions of the source textual and interpersonal meta-functions, through sentiment analysis. Within Computational Linguistics field: 10) the Corpus can provide the material for comparative analysis of mistakes typical for machine and human translations and draw conclusions on the differences in the processes involved. 11) The machine-translation end of this Corpus applications also involves the possible use of the “negative” material (generalizations about wrong translations of certain language units for the given language combination) in developing software. Comparative frequencies of corresponding words in two languages and in translational texts can also give useful insights about evaluating and improving machine translation systems. 12) The Corpus can be used to test or develop algorithms for evaluating the quality of text, such as BLEU (Bilingual Evaluation Understudy) 2002. 13) the accessibility of the Corpus via Internet implies the possibility of verification of the findings based on it and also the possibility of offering other explanations by appealing to parameters that may have been downplayed or ignored in previous studies 2004. Quantitative and qualitative data about the Corpus As of now the collected corpus (available on-line in raw form - http://tc.utmn.ru/files/trc.zip) consists of 1033 texts, English and Russian, sources and corresponding translations. Exact numbers are given in the table below. These texts contain 352792 word tokens, 52% of them being English and 48 percent Russian. We continue to add texts and will do this until we reach 1 million tokens. After this we plan to stop and comprehend all the feedback we will have to this moment from our users. Then the next stage of the project will start, with the aim of 10 million word tokens. We are going to organize on line adding of source and translated texts by volunteers, thus employing the power of crowd-sourcing. However, we will still strive to preserve the present ratio of English and Russian translations. At present we collected texts from Tyumen State University (about 900 texts) and Udmurt State University (about 100 texts). Texts from Chelyabinsk State University and Nizhny Novgorod State Linguistic University are still being processed. Technically the Corpus is organized in plain text files (they will be POS-tagged and syntactically marked-up) and header files, according to standards of Translational English Corpus (http://www.llc.manchester.ac.uk/ctis/research/english-corpus). Header files contain annotations to corresponding text files. The exact fields are: # Translator' sex # Year of study # Mark received for this translation # Draft or final version # Genre of the text # Type of translation (exam or routine work) # Situation of translation (home or classroom) # Year of performing translation # Section of corpus (whether the text is a source or a translation) #University We plan to use TEC Browser tool (http://modnlp.berlios.de) in order to allow users to restrict their searches to particular sub-corpora. For example, one could look for translations of a certain phrase made by male students of Tyumen state university in 2009 and compare them with translations made by female students. Further research As of now the Learner Translator Corpus is still under development and within next few months we will face the following tasks: #Increase corpus volume while keeping its structure intact. #Develop graphical user interface available on-line. It would be based on TEC Browser client, but we need to seriously expand it functionality (presently it can't work with aligned texts). #Provide morphological and syntactic mark-up to make it possible to use the corpus in grammatical contrastive and translation research. #It is ambitious but not altogether impossible to supply a small part of the corpus with descriptive linguistic mark-up employing XML. Such mark-up can be based on the most basic and universally recognized types of mistakes such as form mistakes and content mistakes discriminated on the degree they interfere with rendering the source message. #One can go on and create a GUI which will allow a reviser (such as a teacher or a critic) to mark mistakes electronically and therefore make a revised copy of a student's translation easily and automatically assessable and a ready-made material for further extension of the Imperfect Translation Corpus into the bargain. References Baker M. (2004), A corpus-based view of similarity and difference in translation, International Journal of Corpus Linguistics'', Volume 9, Number 2, 2004, pp. 167-193 Kübler N. A Comparable Learner Translator Corpus: creation and use. Proceedings of the Comparable Corpora Workshop of the LREC Conference, May 31 2008, Marrakech, Maroc, pp 73-78 Papineni K., Roukos S., Ward T., Zhu W. BLEU: a method for automatic evaluation of machine translation.'' ACL 02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Association for Computational Linguistics''. Stroudsburg, PA, USA, 2002, pp 311-318 Sosnina E. Russian Translation Learner Corpus: The First Insights. The proceedings of the 6 international scientific conference «Interactive systems: problems of human-computer interaction»,Ulyanovsk: UlSTU, 2005 Категория:RLPC